# <center> Pandas Merging & Joining Data </center>

- [Simple Joining with Concat Function](#section_1)
- [Complex Joining with Merge Function](#section_2)

<hr>

### Pandas Merging & Joining Data <a class="anchor" id="section_0"></a>

Data professionals often need to combine different data sources for analysis projects. For example, if you are working on a data science project to analyze how sports games impact food and beverage sales, you may need to collect several datasets such as game timetables, sports teams’ performance, sports venues, and capacity, as well as sales figures for multiple vendors. 

If you are using the Pandas library for your project, it's likely that each dataset is stored in a separate DataFrame. Luckily, the library provides a set of tools that allow us to merge and join multiple DataFrames to create large datasets for analysis. 

In this section, we will learn the two most common ways to combine DataFrames in the Pandas library:

* **pd.concat([DataFrame1, DataFrame2]): Simple combining two or more Pandas dataframes in a column-wise or row-wise approach.**

* **pd.merge([DataFrame1, DataFrame2]): Complex column-wise combining of Pandas dataframes in a SQL-like way.**

### Simple Joining with Concat Function <a class="anchor" id="section_1"></a>

The [concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function is used to add together one or more DataFrames. To demonstrate how it works, we will use the function to combine multiple toy DataFrames about popular sports tournaments like FIFA Soccer World Cup and Rugby World Cup. Each dataset has different pieces of information such as winning team, host country, attendance size as shown in the code below:

In [1]:
import pandas as pd

In [75]:
# FIFA World Cup Winning Teams
df_fifa_world_cup_winners = pd.DataFrame({'year': [2018,2014,2010,2006,2002,1998],
        'winner': ['France','Germany','Spain','Italy','Brazil','France'],
        'host_country': ['Russia','Brazil','South Africa',
        'Germany','South Korea','Japan']})

# Display DataFrame
df_fifa_world_cup_winners

Unnamed: 0,year,winner,host_country
0,2018,France,Russia
1,2014,Germany,Brazil
2,2010,Spain,South Africa
3,2006,Italy,Germany
4,2002,Brazil,South Korea
5,1998,France,Japan


In [77]:
# Rugby World Cup Winning Teams
df_rugby_world_cup_winners = pd.DataFrame({'year': [1999,2003,2007,2011,2015,2019],
                                           'winner': ['Australia','England','South Africa','New Zealand','New Zealand','South Africa'],
                                           'host_country': ['Wales','Australia','France','New Zealand','England','Japan'],
                                           'venue':['Millennium Stadium','Telstra Stadium','Stade de France','Eden Park','Twickenham','Nissan Stadium'],
                                           'attendance':[72500,82957,80430,61079,80125,70103]})

# Display DataFrame
df_rugby_world_cup_winners

Unnamed: 0,year,winner,host_country,venue,attendance
0,1999,Australia,Wales,Millennium Stadium,72500
1,2003,England,Australia,Telstra Stadium,82957
2,2007,South Africa,France,Stade de France,80430
3,2011,New Zealand,New Zealand,Eden Park,61079
4,2015,New Zealand,England,Twickenham,80125
5,2019,South Africa,Japan,Nissan Stadium,70103


We first noticed the DataFrames above have some common information such as the year of the event, the winning team name, and the host country. However, the Rugby World Cup dataset has two extra columns: venue and attendance.

Let's try to create a large dataset with all winning FIFA and Rugby world cup teams. The code below demonstrates how to use all the common column names to stack the two DataFrames on top of each other.

In [78]:
# Join the 2 DataFrames using the concat() method
df_teams = pd.concat([df_fifa_world_cup_winners[['year', 'winner', 'host_country']],
                     df_rugby_world_cup_winners[['year', 'winner', 'host_country']]])

# Display the DataFrame
df_teams

Unnamed: 0,year,winner,host_country
0,2018,France,Russia
1,2014,Germany,Brazil
2,2010,Spain,South Africa
3,2006,Italy,Germany
4,2002,Brazil,South Korea
5,1998,France,Japan
0,1999,Australia,Wales
1,2003,England,Australia
2,2007,South Africa,France
3,2011,New Zealand,New Zealand


We created a new DataFrame object called df_teams with 12 records from the two parent datasets. However, the resulting DataFrame raises some issues. 

First, it becomes impossible to identify if a given team was part of the original Rugby or Soccer datasets. Second, the new DataFrame object inherits the original index values from the parent datasets. This behaviour can be controlled by adjusting the [concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function parameters. The keys parameter can be used to track the data source by adding extra index values to the new DataFrame as shown in the example below. This feature would allow us to query and access specific subsets of the DataFrame using the newly assigned index value.

In [79]:
# Add data source index values to the new DataFrame
df_teams = pd.concat([df_fifa_world_cup_winners[['year', 'winner', 'host_country']],
                     df_rugby_world_cup_winners[['year', 'winner', 'host_country']]], 
                     keys = ['soccer', 'rugby'])

# Display the DataFrame
df_teams

Unnamed: 0,Unnamed: 1,year,winner,host_country
soccer,0,2018,France,Russia
soccer,1,2014,Germany,Brazil
soccer,2,2010,Spain,South Africa
soccer,3,2006,Italy,Germany
soccer,4,2002,Brazil,South Korea
soccer,5,1998,France,Japan
rugby,0,1999,Australia,Wales
rugby,1,2003,England,Australia
rugby,2,2007,South Africa,France
rugby,3,2011,New Zealand,New Zealand


In another scenario, we may prefer the new DataFrame to have totally new index values. This option can be achieved by setting the ignore_index parameter to true as shown in the code below:

In [80]:
# Ignore old index values in the new DataFrame
df_teams = pd.concat([df_fifa_world_cup_winners[['year', 'winner', 'host_country']],
                     df_rugby_world_cup_winners[['year', 'winner', 'host_country']]], 
                     ignore_index = True)

# Display the DataFrame
df_teams

Unnamed: 0,year,winner,host_country
0,2018,France,Russia
1,2014,Germany,Brazil
2,2010,Spain,South Africa
3,2006,Italy,Germany
4,2002,Brazil,South Korea
5,1998,France,Japan
6,1999,Australia,Wales
7,2003,England,Australia
8,2007,South Africa,France
9,2011,New Zealand,New Zealand


The [concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function also allows us to combine multiple datasets even with little or no common values among them. The newly generated dataset will include all columns from the original DataFrame, with missing values replaced with null or NaN as shown in the example below.

In [81]:
# The new DataFrame will include all original columns
df_teams = pd.concat([df_fifa_world_cup_winners,
                     df_rugby_world_cup_winners])

# Display the DataFrame
df_teams

Unnamed: 0,year,winner,host_country,venue,attendance
0,2018,France,Russia,,
1,2014,Germany,Brazil,,
2,2010,Spain,South Africa,,
3,2006,Italy,Germany,,
4,2002,Brazil,South Korea,,
5,1998,France,Japan,,
0,1999,Australia,Wales,Millennium Stadium,72500.0
1,2003,England,Australia,Telstra Stadium,82957.0
2,2007,South Africa,France,Stade de France,80430.0
3,2011,New Zealand,New Zealand,Eden Park,61079.0


The examples above demonstrate how the Pandas [concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function can create new datasets by adding DataFrame objects on top of each other (row axis). The function also provides the possibility to add the DataFrames sideways (column axis). This option is controlled using the axis parameter when it's set to 0 or 1 as shown in the example below:

In [11]:
# The new DataFrame will include all original columns aligned horizontally
df_teams = pd.concat([df_fifa_world_cup_winners,
                     df_rugby_world_cup_winners], axis = 1)

# Display the DataFrame
df_teams

Unnamed: 0,year,winner,host_country,year.1,winner.1,host_country.1,venue,attendance
0,2018,France,Russia,1999,Australia,Wales,Millennium Stadium,72500
1,2014,Germany,Brazil,2003,England,Australia,Telstra Stadium,82957
2,2010,Spain,South Africa,2007,South Africa,France,Stade de France,80430
3,2006,Italy,Germany,2011,New Zealand,New Zealand,Eden Park,61079
4,2002,Brazil,South Korea,2015,New Zealand,England,Twickenham,80125
5,1998,France,Japan,2019,South Africa,Japan,Nissan Stadium,70103


### Complex Joining with Merge Function <a class="anchor" id="section_2"></a>

Pandas [merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) provides the functionality to join DataFrame and Series objects in a way similar to relational database operations. Users who are familiar with merging datasets using SQL but new to Pandas might be interested in this comparison. In this set of examples, we will demonstrate the use of the [merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function and highlight the use of some important parameters. 

The below code will create two DataFrame objects with one similar and four different columns, namely, key, column_A, column_B, column_C, and column_D. The key column includes some similar values that appear on both DataFrames as well as uncommon ones.

In [12]:
# Create left DataFrame
df_left = pd.DataFrame({
        "key": ["K0", "K1", "K2", "K3", "K4", "K5"],
    "column_A": ["A0", "A1", "A2", "A3", "A4", "A5"],
    "column_B": ["B0", "B1", "B2", "B3", "B4", "B5"]})

# Create right DataFrame
df_right = pd.DataFrame({
        "key": ["K0", "K1", "K2", "K3","K6"],
    "column_C": ["C0", "C1", "C2", "C3","C6"],
    "column_D": ["D0", "D1", "D2", "D3","D6"]})

In [6]:
# Create employees dataset
df_departments = pd.DataFrame(
    {'department_id':['D1','D2','D3','D4'],
     'department_name':['IT','SALES','HR','R&D'],
     'department_location':['location_1','location_1','location_2','location_2']})

# Create departments dataset
df_employees = pd.DataFrame(
    {'employee_name':['Michael','Alice','Max','Janet','Ali'],
     'department_id':['D1','D1','D2','D3','D6'],
    'salary':[500,1000,1500,2000,2500]})

In [7]:
df_employees

Unnamed: 0,employee_name,department_id,salary
0,Michael,D1,500
1,Alice,D1,1000
2,Max,D2,1500
3,Janet,D3,2000
4,Ali,D6,2500


In [8]:
df_departments

Unnamed: 0,department_id,department_name,department_location
0,D1,IT,location_1
1,D2,SALES,location_1
2,D3,HR,location_2
3,D4,R&D,location_2


In [14]:
# Merge the two DataFrames using the common column
pd.merge(df_employees, df_departments, on='department_id', 
         indicator = True, how = 'outer')

Unnamed: 0,employee_name,department_id,salary,department_name,department_location,_merge
0,Michael,D1,500.0,IT,location_1,both
1,Alice,D1,1000.0,IT,location_1,both
2,Max,D2,1500.0,SALES,location_1,both
3,Janet,D3,2000.0,HR,location_2,both
4,Ali,D6,2500.0,,,left_only
5,,D4,,R&D,location_2,right_only


In [13]:
left

Unnamed: 0,key,column_A,column_B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B4
5,K5,A5,B5


In [14]:
right

Unnamed: 0,key,column_C,column_D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3
4,K6,C6,D6


To join the two DataFrame objects using the  [merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function, we can use a set of parameters to identify common values, joining type, and source object. 

* on: This parameter can be used if the common column has the same name in both DataFrames
* left on: Identify the joining column in the left DataFrame
* right on: identify the joining column in the right DataFrame
* indicator: Add an extra column to the joined DataFrame to show the source of each column
* how: Identify the joining type as one of four possible options [inner, left, right, outer]. 

The following code will merge the two tables using the key column and the default inner joining method. We notice that only four records we selected represent the key values that appear in both DataFrames [k0, k1, k2, k3].

In [15]:
# Merge the two DataFrames using the common column
df_results = pd.merge(df_left, df_right, on='key', 
                   how='inner', indicator=True)

# Display the DataFrame
results

Unnamed: 0,key,column_A,column_B,column_C,column_D,_merge
0,K0,A0,B0,C0,D0,both
1,K1,A1,B1,C1,D1,both
2,K2,A2,B2,C2,D2,both
3,K3,A3,B3,C3,D3,both


By changing the how parameter to outer value, we notice the joined DataFrame includes all records from both original DataFrames as shown in the example below: 

In [16]:
# Merge the two DataFrames using the common column
df_results = pd.merge(df_left, df_right, on='key', 
                   how='outer', indicator=True)

# Display the DataFrame
df_results

Unnamed: 0,key,column_A,column_B,column_C,column_D,_merge
0,K0,A0,B0,C0,D0,both
1,K1,A1,B1,C1,D1,both
2,K2,A2,B2,C2,D2,both
3,K3,A3,B3,C3,D3,both
4,K4,A4,B4,,,left_only
5,K5,A5,B5,,,left_only
6,K6,,,C6,D6,right_only
