<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Joining Data in Pandas<br><br>
Tiago Ventura</center></center> <h1> 

---

**In this Notebook we cover**

This is our last notebook of data wrangling with `Pandas`. We will manly cover: 

- Joining Methods in Pandas


## Joining Methods

It is unlikely that your work as a data scientist will be restricted to analyze one isolated data frame -- or table in the `SQL`/Database lingo. Most often you have multiple tables of data, and your work will consist of combining them to answer the questions that you’re interested in. 

There two major reasons for why complex datasets are often stored across multiple tables: 

- A) Integrity and efficiency issues often referred as [database normalization](https://en.wikipedia.org/wiki/Database_normalization). As your data grow in size and complexity, keeping a unified database leads to redundancy and possible errors on data entry. 

- B) Data comes from different sources. As a researcher, you are being creative and augmenting the information at your hand to answer a policy question. 

Database normalization works as an <span style='color:blue'> **constraint**</span>, a guardrail to protect your data infrastructure. The second reason for why joining methods matter is primarily an <span style='color:red'> **opportunity** </span>. Keep always your eyes open for creative ways to connect data sources. Very critical research ideas might emerge from data augmentation from joining initially conceived unrelated datasets.

### `pandas` methods:

`pandas` comes baked in with a fully functional method (`pd.merge`) to join data. However, we'll use the `SQL` language when talking about joins to stay consistent with `SQL` and `R Tidyverse`. 

Let's start creating two tables for us to play around with `pandas` join methods

In [36]:
# Two fake data frames
import pandas as pd
data_x = pd.DataFrame(dict(key = ["1","2","3"],
                           var_x = ["x1","x2","x3"]))
data_y = pd.DataFrame(dict(key = ["1","2","4"],
                           var_y = ["y1","y2","y4"]))
display(data_x)
display(data_y)

Unnamed: 0,key,var_x
0,1,x1
1,2,x2
2,3,x3


Unnamed: 0,key,var_y
0,1,y1
1,2,y2
2,4,y4


### Left Join: `pd.merge(<data>, how="left")`

- Keep all keys from the data set of the right

<br><br>

<div>
<img src="./figs/left_join.png" width="500"/>
</div>


In [14]:
# chaining datasets
data_x.merge(data_y,how="left") 

Unnamed: 0,key,var_x,var_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,


In [15]:
# calling the construct
pd.merge(data_x, data_y, how="left")

Unnamed: 0,key,var_x,var_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,


### Right Join:  `pd.merge(<data>, how="right")`

- Keep all keys from the data set of the right

<br><br>

<div>
<img src="./figs/right_join.png" width="500"/>
</div>

In [16]:
# chaining datasets
data_x.merge(data_y,how="right") 

Unnamed: 0,key,var_x,var_y
0,1,x1,y1
1,2,x2,y2
2,4,,y4


### Full (outer) Join: `pd.merge(<data>, how="outer")`

- Keep all keys from left and right

<br><br>


<div>
<img src="./figs/full_join.png" width="500"/>
</div>

In [17]:
# chaining datasets
data_x.merge(data_y,how="outer") 

Unnamed: 0,key,var_x,var_y
0,1,x1,y1
1,2,x2,y2
2,3,x3,
3,4,,y4


### Inner Join

- Keep only matched keys

<br><br>

![](https://d33wubrfki0l68.cloudfront.net/3abea0b730526c3f053a3838953c35a0ccbe8980/7f29b/diagrams/join-inner.png)

In [18]:
# chaining datasets
data_x.merge(data_y,how="inner") 

Unnamed: 0,key,var_x,var_y
0,1,x1,y1
1,2,x2,y2


### Handling disparate column names

In [35]:
# rename datasets
data_X = data_x.rename(columns={"key":"country_x"})
data_Y = data_y.rename(columns={"key":"country_y"})

# join now, and you will get an error
pd.merge(data_X,
         data_Y,
         how="left",
         left_on = "country_x",  # The left column naming convention 
         right_on="country_y") # The right column naming convention )

Unnamed: 0,country_x,var_x,country_y,var_y
0,1,x1,1.0,y1
1,2,x2,2.0,y2
2,3,x3,,


### Concatenating by columns and rows


#### By Rows: `pd.concat(<>, axis=0)`
<br><br>
![](./figs/rbind.png)

In [37]:
# full of NAS because the second columnes do not have the same name
pd.concat([data_x,data_y],
            sort=False) # keep the original structure

Unnamed: 0,key,var_x,var_y
0,1,x1,
1,2,x2,
2,3,x3,
0,1,,y1
1,2,,y2
2,4,,y4


#### By Columns: `pd.concat(<>, axis=1)`

<br><br>
![](./figs/cbind.png)

In [44]:
pd.concat([data_x,data_y],axis=1)

Unnamed: 0,key,var_x,key.1,var_y
0,1,x1,1,y1
1,2,x2,2,y2
2,3,x3,4,y4


Note that when we row bind two `DataFrame` objects, `pandas` will preserve the indices. And you can use this to sort yoru dataset. 

In [45]:
pd.concat([data_x,data_y],axis=0).sort_index()

Unnamed: 0,key,var_x,var_y
0,1,x1,
0,1,,y1
1,2,x2,
1,2,,y2
2,3,x3,
2,4,,y4


To keep the data tidy, we can preserve which data is coming from where by generating a hierarchical index using the `key` argument. Of course, this could also be done by creating a unique columns in each dataset before the join.

In [49]:
pd.concat([data_x,data_y],axis=0, keys=["data_x","data_y"])

Unnamed: 0,Unnamed: 1,key,var_x,var_y
data_x,0,1,x1,
data_x,1,2,x2,
data_x,2,3,x3,
data_y,0,1,,y1
data_y,1,2,,y2
data_y,2,4,,y4


Lastly, note that we can completely ignore the index if need be.

In [50]:
pd.concat([data_x,data_y],axis=0,ignore_index=True)

Unnamed: 0,key,var_x,var_y
0,1,x1,
1,2,x2,
2,3,x3,
3,1,,y1
4,2,,y2
5,4,,y4
