# Merging Dataframes

In this notebook we'll explore a way to combine datasets from different files based on the columns or variables in the files. Two files can be combined based on one or more common variables in the files. We refer to these common variables as keys. We can then use the Pandas `merge` function to create a new dataframe based on  the common key variable(s). The function `merge` has the following signature:

     pandas.merge(left_data_frame, right_data_frame, on= , how='left|right|inner|outer').
 
We'll run through the operation of the function using different types of merging. We indicate the type of merging with the function option `how`.


## 1. Merging example
 
We'll use the example found here [Python Merge documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) to illustrate use of the function merge. Given the dataframes `left` and `right`, we'll merge the two based on two keys. When working with more than one key variable, use an array to store the keys as follows: `['key1', 'key2']`. If using a single key variable use a string data type instead, `on='key'`. 

In [0]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                         'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3']})

In [0]:
# display the left dataframe
left

Unnamed: 0,A,B,key1,key2
0,A0,B0,K0,K0
1,A1,B1,K0,K1
2,A2,B2,K1,K0
3,A3,B3,K2,K1


In [0]:
# display the right dataframe
right

Unnamed: 0,C,D,key1,key2
0,C0,D0,K0,K0
1,C1,D1,K1,K0
2,C2,D2,K1,K0
3,C3,D3,K2,K0


What is the result of a left merge?

In [0]:
# we perform a left merge
df = pd.merge(left, right, on=['key1','key2'], how='left')

In [0]:
# display the merged dataframe
df

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A2,B2,K1,K0,C1,D1
2,A2,B2,K1,K0,C2,D2
3,,,K2,K0,C3,D3


The result of a left merge is that where the key values of the right dataframe are not in the left dataframe, they are replaced with not a number or `NaN` values in the merged dataframe.

A right merge:

In [0]:
# now do a right merge
df = pd.merge(left, right, on=['key1','key2'], how='right')

In [0]:
# display the merged dataframe
df

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A2,B2,K1,K0,C1,D1
2,A2,B2,K1,K0,C2,D2
3,,,K2,K0,C3,D3


Key values in the left frame but not in the right frame are replaced with the values `NaN` in the merged dataframe.

### 1.1 Practice

Go ahead and perform inner and outer merges on the datasets. What differences to you notice?

In [0]:
# inner join

In [0]:
# outer join

## 2. Practice

For this exercise you'll perform an inner merge on a datasets containing movie information. The datasets are available from [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/latest/). The files have already been downloaded for you. Unzip the file ml-latest-small.zip. Inside you'll find the files `link.csv, movies.csv, ratings.csv` and `tags.csv`. You need to merge `links.csv` and `movies.csv`.

a) Load the necessary libraries. 

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [0]:
# to show plots immediately
%matplotlib inline 

b) Upload the files into dataframes.

c) Explore the data. For instance, you could determine the dimensions of the dataframes. Is it helpful to learn such information?

d) Decide the key variable for merging the dataframes and use inner merge to determine a new dataframe.