# An Introduction to Combining Data with Pandas' `merge`, `join`, and `concat`

This notebook contains code examples to use with the article of the same name, along with light commentary on each of the examples. 

In [None]:
import pandas as pd

pd.set_option("display.max_columns", 50)

## Importing Data

Here you will import the temperature and precipitation climate normals datasets into DataFrames. Calling `.head()` on the DataFrame will give you a 5 row preview of your data, and the `shape` attribute will give you the dimensions of the data in the form `(rows, columns)`. These are great sanity checks to run before doing too much with the data. 

In [None]:
climate_temp = pd.read_csv("climate_temp.csv")
climate_temp.head()

In [None]:
climate_temp.shape

In [None]:
climate_precip = pd.read_csv("climate_precip.csv")
climate_precip.head()

In [None]:
climate_precip.shape

## merge()

In this section, you will learn about `merge()` functionality in Pandas.

### Inner Join

Here, you will use a plain `merge()` call to do an inner join and learn how this can result in a smaller, more focused dataset. First, you will create a new DataFrame object that contains the precipitation data from one station.

In [None]:
precip_one_station = climate_precip.query("STATION == 'GHCND:USC00045721'")
precip_one_station.head()

In [None]:
precip_one_station.shape

In [None]:
inner_merged = pd.merge(precip_one_station, climate_temp)
inner_merged.head()

How many rows do you think this merged DataFrame has?

In [None]:
inner_merged.shape

You get 365 rows because any non-matching rows are discarded in an inner join, which is the default merge method for a `merge()` call, and `precip_one_station` had only 365 rows.

What if you want to merge both full datasets, but specify which columns to join on? In this case, you will use the `on` parameter:

In [None]:
inner_merged_total = pd.merge(
    climate_temp, climate_precip, on=["STATION", "DATE"]
)
inner_merged_total.head()

In [None]:
inner_merged_total.shape

You can specify a single _key column_ with a string, or multiple key columns with a list, as in the above example. This results in a DataFrame with 123005 rows and 48 columns. 

Why 48 columns instead of 47? Because you specified the keys columns to join on, Pandas doesn't try to merge all mergeable columns. This can result in "duplicate" column names, which may or may not have different values. "Duplicate" is in quotes because the columns will actually have new names, by default they are appended with `_x` and `_y`. You can also use the `suffixes` parameter to control what is appended to the column names.

### Outer Join
With the outer join, you will retain rows that don't have matches as well. For this example, you will use the smaller precipitation DataFrame `precip_one_station` with the full `climate_temp` DataFrame and join with `STATION` and `DATE` columns as the key columns. Take a second and think about how many rows you expect the new DataFrame to have.

In [None]:
outer_merged = pd.merge(
    precip_one_station, climate_temp, how="outer", on=["STATION", "DATE"]
)
outer_merged.head()

In [None]:
outer_merged.shape

If you remember from when you checked the `.shape` attribute of `climate_temp`, you'll see that the number of rows in `outer_merged` matches that. With an outer join, you can expect to have the same number of rows as the larger DataFrame, since none are lost like they are in an inner join. 

### Left Join
Also known as a left outer join. In this join, you will retain rows that don't have matches only on the left (or first) DataFrame to be merged.

In [None]:
left_merged = pd.merge(
    climate_temp, precip_one_station, how="left", on=["STATION", "DATE"]
)
left_merged.head()

In [None]:
left_merged.shape

Here, you see that the number of rows in the resulting DataFrame matches that of the rows in the `climate_temp` DataFrame. What if we switched the positions of the two DataFrames that we are merging?

In [None]:
left_merged_reversed = pd.merge(
    precip_one_station, climate_temp, how="left", on=["STATION", "DATE"]
)
left_merged_reversed.head()

In [None]:
left_merged_reversed.shape

### Right Join
This works the same as the left join, however non-matching rows are only retained in the _right_ DataFrame. In the next example, you will recreate the `left_merged` DataFrame but with a right join.

In [None]:
right_merged = pd.merge(
    precip_one_station, climate_temp, how="right", on=["STATION", "DATE"]
)
right_merged.head()

In [None]:
right_merged.shape

Here, you simply flipped the positions of the input DataFrames and specified a right join. When you inspect `right_merged`, you might notice that it's not exactly the same as `left_merged`. The only difference between the two is the order of the columns: the first input's columns will always be the first in the newly formed DataFrame.

## .join()
`.join()` uses `merge()` under the hood, but provides a much more simplified interface to `merge()` and by default joins on indexes. Here is an introductory example using the `lsuffix` and `rsuffix` parameters to handle overlapping column names.

In [None]:
precip_one_station.join(climate_temp, lsuffix="_left", rsuffix="_right")

If you inspect the data, you'll see that overlapping columns are kept, just renamed to be unique. If we flip this around and instead call `.join()` on the larger DataFrame, you'll notice that the DataFrame is larger, but data that doesn't exist in the smaller DataFrame (`precip_one_station`) is filled in with `NaN` (_Not a Number_) values.

In [None]:
climate_temp.join(precip_one_station, lsuffix="_left", rsuffix="_right")

If you must use `.join()` and want to merge the columns, you must set them to be indexes first. First take a look at this previously used `merge()` operation:

In [None]:
inner_merged_total = pd.merge(
    climate_temp, climate_precip, on=["STATION", "DATE"]
)
inner_merged_total.head()

In [None]:
inner_joined_total = climate_temp.join(
    climate_precip.set_index(["STATION", "DATE"]),
    on=["STATION", "DATE"],
    how="inner",
    lsuffix="_x",
    rsuffix="_y",
)
inner_joined_total.head()

Because `.join()` works on indexes, if we want to recreate `merge()` before, then we must set indexes on the join columns we specify. In this example, you used the `.set_index()` method to set your indexes to the key columns within the join. 

Below you will see an almost-bare `.join()` call. Because there are overlapping columns, you will have to specify a suffix with `lsuffix`, `rsuffix`, or both, but this example will demonstrate the more typical behavior of `.join()`.

In [None]:
climate_temp.join(climate_precip, lsuffix="_left")

## concat()

First, you will see a basic concatenation along axis 0.

In [None]:
double_precip = pd.concat([precip_one_station, precip_one_station])
double_precip.head()

To reset the index, use the `ignore_index` optional parameter.

In [None]:
reindexed = pd.concat(
    [precip_one_station, precip_one_station], ignore_index=True
)
reindexed.head()

When axis labels for the axis you are **not** concatenating along don't match (for example, column labels when concatenating along rows), then all columns are preserved and missing data is filled in with `NaN`. 

In [None]:
outer_joined = pd.concat([climate_precip, climate_temp])
outer_joined.head()

In [None]:
inner_joined = pd.concat([climate_temp, climate_precip], join="inner")
inner_joined.head()

In [None]:
inner_joined.shape

To illustrate how this would work with rows, concatenate along columns instead:

In [None]:
inner_joined_cols = pd.concat(
    [climate_temp, climate_precip], axis="columns", join="inner"
)
inner_joined.head()

In [None]:
inner_joined_cols.shape

You can also use the `keys` parameter to set hierarchical axis labels which can be used, for example, to preserve original labels while maintaining labels that tell you which dataset each row or column came from.

In [None]:
hierarchical_keys = pd.concat(
    [climate_temp, climate_precip], keys=["temp", "precip"]
)
hierarchical_keys.head()

In [None]:
hierarchical_keys.tail()