In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Joins

This section introduces *joins* in `pandas`, a very useful operation for
combining two or more dataframes together.

We'll continue looking at the baby names data. We'll use joins to check some
trends mentioned in the New York Times article about baby names
{cite}`williamsLilith2021`. The article talks about how certain categories of
names have become more or less popular over time. For instance, it mentions
that mythological names like Julius and Cassius have become popular, while baby
boomer names like Susan and Debbie have become less popular. How has the
popularity of these categories changed over time?

We've taken the names and categories in the NYT article and put them in a small
dataframe:

In [14]:
nyt = pd.read_csv('nyt_names.csv')
nyt

Unnamed: 0,nyt_name,category
0,Lucifer,forbidden
1,Lilith,forbidden
2,Danger,forbidden
...,...,...
20,Venus,celestial
21,Celestia,celestial
22,Skye,celestial


To see how popular the categories of names are, you need to join the
`nyt` dataframe with the `baby` dataframe since the `baby` table holds
the actual name counts.

In [3]:
baby = pd.read_csv('babynames.csv')
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
...,...,...,...,...
2020719,Verona,F,5,1880
2020720,Vertie,F,5,1880
2020721,Wilma,F,5,1880


Imagine going down each row in `baby` and asking, is this name in the `nyt`
table? If so, then add the value in the `category` column to the row. That's
the basic idea behind a join. Let's look at a few simpler examples first.

## Basic Joins

Let's make smaller versions of the `baby` and `nyt` tables so it's easier to
see what happens when we join tables together.

In [48]:
nyt_small = nyt.iloc[[11, 12, 13]].reset_index(drop=True)
nyt_small

Unnamed: 0,nyt_name,category
0,Karen,boomer
1,Julius,mythology
2,Cassius,mythology


In [49]:
names_to_keep = ['Julius', 'Karen', 'Noah']
baby_small = (baby
 .query("Year == 2020 and Name in @names_to_keep")
 .reset_index(drop=True)
)
baby_small

Unnamed: 0,Name,Sex,Count,Year
0,Noah,M,18252,2020
1,Julius,M,960,2020
2,Karen,M,6,2020
3,Karen,F,325,2020
4,Noah,F,305,2020


To join tables in `pandas`, use the `.merge()` method:

In [50]:
baby_small.merge(nyt_small,
                 left_on='Name',        # column in left table to match
                 right_on='nyt_name')   # column in right table to match

Unnamed: 0,Name,Sex,Count,Year,nyt_name,category
0,Julius,M,960,2020,Julius,mythology
1,Karen,M,6,2020,Karen,boomer
2,Karen,F,325,2020,Karen,boomer


Notice that the new table has the columns of both `baby_small` and `nyt_small`
tables. The rows with `Noah` are gone. And the remaining rows have their
matching `category` from `nyt_small`.

When we join two tables together, we tell `pandas` the column(s) from each
table that we want to use to join (the `left_on` and `right_on` arguments).
`pandas` matches rows together when the values in the joining columns match, as
shown in figure {numref}`inner-join`.

```{figure} figures/inner-join.svg
---
name: inner-join
alt: inner-join
---
To join, `pandas` matches rows using the values in the `Name` and `nyt_name`
columns.
```

By default, `pandas` does an *inner join*. If either table has rows that don't
have matches in the other table, `pandas` drops those rows from the result. In
this case, the rows with `Noah` in `baby_small` don't have matches in
`nyt_small`, so they are dropped. Also, the row with `Cassius` in `nyt_small`
don't have matches in `baby_small`, so they are dropped as well. Only the rows
with a match stay in the final result.