# <span style="color:darkblue"> Today: Pandas Merge and Concat Methods</span>

<font size="5"> 

- You can complement this lesson with this Datacamp course: https://app.datacamp.com/learn/courses/joining-data-with-pandas

- The ```DataFrame.merge()``` and ```DataFrame.concat()``` methods are two ways to combine different data frames together 

- **Merge:** combines data on common column values

- **Concat:** combines data based on column names

## <span style="color:darkblue"> 1. The merge method</span>

<font size="5"> 

- When you want to combine data objects based on one or more keys --> **combine rows that share data**

- When using ```merge```, we need to provide two required arguments:

    - The **left** ```DataFrame```
    
    - The **right** ```DataFrame```

- Merge is a flexible method and consequently it has many options

- There is one option that is particularly relevant to understand: ```how``` --> **specifies the type of merge we want to execute**

![rdb_us](images/rdb_us_congress.png)

### **Types of merge**

<font size="5"> 

- There are four types of merges:

    1. inner
    
    2. outer 

    3. left 

    4. right

![merges](images/merges.png)


## <span style="color:darkblue"> Sample of the U.S. Congress relational data </span>

![rdb_us](images/rdb_us_congress.png)

<font size="5"> 

- **Example 1:** we want to know what proportion of bills actions were proposed by members of the Senate and what percentage by members of the House of Representatives

    1. Check how many unique ```member_id``` there are in the **bills actions** table

    2. Check how many unique ```member_id``` there are in the **congress member** table

    3. What type of merge should we use?

In [1]:
import pandas as pd

members = pd.read_csv('data/us_congress_member.csv')
actions = pd.read_csv('data/bills_actions.csv')

<font size="5"> 

- Let's try the inner merge first. [Here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) are the Pandas options:

    - **how:** {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, **default** ‘inner’

    - **on:** label or list

    - **left_on** and **right_on** in case the variables have different numbers

In [6]:
pd.merge(actions, members, on='member_id')

Unnamed: 0,congress,bill_number,bill_type,action,main_action,object,member_id,full_name,last_name,member_title,state,party_name,chamber
0,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
1,116,1160,s,S.Amdt.2659 Amendment SA 2659 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
2,116,1309,s,S.Amdt.1275 Amendment SA 1275 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
3,116,1434,s,S.Amdt.1269 Amendment SA 1269 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
4,116,1636,s,S.Amdt.2707 Amendment SA 2707 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3298,116,840,hr,POSTPONED PROCEEDINGS - At the conclusion of d...,other house amendment actions,amendment,485.0,Jack Bergman,Bergman,Representative,Michigan,Republican,House
3299,116,840,hr,H.Amdt.23 Amendment (A004) offered by Mr. Berg...,house amendment offered,amendment,485.0,Jack Bergman,Bergman,Representative,Michigan,Republican,House
3300,116,987,hr,H.Amdt.222 Amendment (A003) offered by Mr. Wel...,house amendment offered,amendment,922.0,Peter Welch,Welch,Representative,Vermont,Democratic,House
3301,116,986,hr,H.Amdt.203 Amendment (A006) offered by Mr. Hol...,house amendment offered,amendment,406.0,George Holding,Holding,Representative,North Carolina,Republican,House


<font size="5"> 

- Is something going to change if we use ```how='left'```?

In [7]:
pd.merge(actions, members, on='member_id', how='left')

Unnamed: 0,congress,bill_number,bill_type,action,main_action,object,member_id,full_name,last_name,member_title,state,party_name,chamber
0,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
1,116,1031,s,S.Amdt.2698 Amendment SA 2698 proposed by Sena...,senate amendment proposed (on the floor),amendment,675.0,Josh Hawley,Hawley,Senator,Missouri,Republican,Senate
2,116,1160,s,S.Amdt.2659 Amendment SA 2659 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mitch McConnell,McConnell,Senator,Kentucky,Republican,Senate
3,116,1199,s,"Committee on Health, Education, Labor, and Pen...",senate committee/subcommittee actions,senate bill,1561.0,Lamar Alexander,Alexander,Senator,Tennessee,Republican,Senate
4,116,1208,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580.0,Lindsey Graham,Graham,Senator,South Carolina,Republican,Senate
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3298,116,9,hr,H.Amdt.172 Amendment (A004) offered by Ms. Kus...,house amendment offered,amendment,36.0,Ann M. Kuster,Kuster,Representative,New Hampshire,Democratic,House
3299,116,9,hr,H.Amdt.171 Amendment (A003) offered by Ms. Hou...,house amendment offered,amendment,186.0,Chrissy Houlahan,Houlahan,Representative,Pennsylvania,Democratic,House
3300,116,9,hr,H.Amdt.170 Amendment (A002) offered by Ms. Oma...,house amendment offered,amendment,477.0,Ilhan Omar,Omar,Representative,Minnesota,Democratic,House
3301,116,9,hr,POSTPONED PROCEEDINGS - At the conclusion of d...,other house amendment actions,amendment,393.0,"Frank, Jr. Pallone",Pallone,Representative,New Jersey,Democratic,House


<font size="5"> 

No difference between the two. They both have 3303 rows

<font size="5"> 

- How about if we use ```how='right'```?

In [9]:
pd.merge(actions, members, on='member_id', how='right')

Unnamed: 0,congress,bill_number,bill_type,action,main_action,object,member_id,full_name,last_name,member_title,state,party_name,chamber
0,,,,,,,0.0,A. Donald McEachin,McEachin,Representative,Virginia,Democratic,House
1,,,,,,,1.0,Aaron Schock,Schock,Representative,Illinois,Republican,House
2,116.0,3.0,hr,H.Amdt.719 Amendment (A009) offered by Ms. Fin...,house amendment offered,amendment,2.0,Abby Finkenauer,Finkenauer,Representative,Iowa,Democratic,House
3,116.0,7617.0,hr,H.Amdt.870 Amendment (A012) offered by Ms. Fin...,house amendment offered,amendment,2.0,Abby Finkenauer,Finkenauer,Representative,Iowa,Democratic,House
4,116.0,3055.0,hr,POSTPONED PROCEEDINGS - At the conclusion of d...,other house amendment actions,amendment,3.0,Abigail Davis Spanberger,Spanberger,Representative,Virginia,Democratic,House
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4728,116.0,3055.0,hr,H.Amdt.401 Amendment (A015) offered by Ms. Cla...,house amendment offered,amendment,1809.0,Yvette D. Clarke,Clarke,Representative,New York,Democratic,House
4729,116.0,3469.0,hr,Ms. Clarke (NY) moved to suspend the rules and...,house floor actions,house bill,1809.0,Yvette D. Clarke,Clarke,Representative,New York,Democratic,House
4730,116.0,4739.0,hr,Ms. Clarke (NY) moved to suspend the rules and...,house floor actions,house bill,1809.0,Yvette D. Clarke,Clarke,Representative,New York,Democratic,House
4731,116.0,4761.0,hr,Ms. Clarke (NY) moved to suspend the rules and...,house floor actions,house bill,1809.0,Yvette D. Clarke,Clarke,Representative,New York,Democratic,House


<font size="5"> 

Clearly the right merge has more rows and there are missing for the actions without a member

<font size="5"> 

- Let's save the correct merged data into a new object and perform the calculation we need

- How can I calculate the proportion of legislative actions from Senators and House Representatives 

In [None]:
merged = pd.merge(actions, members, on='member_id', how='left')

In [None]:
sn = sum(merged['member_title']=='Senator')
rn = sum(merged['member_title']=='Representative')

In [None]:
sn/(sn+rn)

0.25615006150061503

In [None]:
rn/(sn+rn)

0.743849938499385

## <span style="color:darkblue"> 1.1. Merging on more than one key</span>

<font size="5"> 

- Instead of passing only a name for the ```on``` option, we pass a list

- **Example 2:** how many actions are associated with the subject ```'Congressional oversight'```?

![rdb_us](images/rdb_us_congress.png)

In [11]:
subjects = pd.read_csv('data/bills_subjects.csv')

In [12]:
mergeds = pd.merge(actions, subjects, on = ['congress', 'bill_number', 'bill_type'])

In [13]:
mergeds

Unnamed: 0,congress,bill_number,bill_type,action,main_action,object,member_id,bill_subject
0,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Criminal procedure and sentencing
1,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Evidence and witnesses
2,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Judicial procedure and administration
3,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Mammals
4,116,1029,s,S.Amdt.1274 Amendment SA 1274 proposed by Sena...,senate amendment proposed (on the floor),amendment,858.0,Service animals
...,...,...,...,...,...,...,...,...
153967,116,9,hr,H.Amdt.169 Amendment (A001) offered by Mr. Esp...,house amendment offered,amendment,6.0,"Presidents and presidential powers, Vice Presi..."
153968,116,9,hr,H.Amdt.169 Amendment (A001) offered by Mr. Esp...,house amendment offered,amendment,6.0,Rural conditions and development
153969,116,9,hr,H.Amdt.169 Amendment (A001) offered by Mr. Esp...,house amendment offered,amendment,6.0,State and local government operations
153970,116,9,hr,H.Amdt.169 Amendment (A001) offered by Mr. Esp...,house amendment offered,amendment,6.0,U.S. territories and protectorates


In [15]:
sum(mergeds['bill_subject'] == 'Congressional oversight')

1869

## <span style="color:darkblue"> 2. The concat method</span>

<font size="5"> 

- With concatenation, the datasets are pulled together along an axis: column or row

![concat](images/concat.png)

<font size="5"> 

- **Example 3:** concatenate the bills actions tables with and without ```member_id```